In this project, I will analyze the correlation of the different factors and try to find which chemical properties influence the quality of red wines?
Let’s focus on each factor and see what we have.
## [1] 1599
## [1] 13
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
From this summary, you can see there are 1599 observations and 13 variables in our file. Now, let’s see some relations in the graphs.
From this graph, we can see that the wine quality is concentrated in level 5 and 6.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
The alcohol level of the red wine is concentrated around 9.5, The average alcohol level is 10.2.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
The density level nearly equals to 1, the average is 0.9967, there is a very small difference between min and max density level.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
The majority of the sulfates level is very small, is around 0.6. The average is 0.68. We can see that the red wine generally contains very few sulfates.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
The curious thing is, the fixed.acidity is almost 10 times bigger than the volatile.acidity. The both are left skewed. They seem have some positive relationship.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
the red wine contains very few citric.acid, the average is 0.27. But there is a very big gap between the max and min, it can be affected by other factors or even affect the quality.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
The sugar level for the red wine is very low, but we still can see there is a very big gap between the sweetest wine and the least sweet wine.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
very few chlorides inside the quality, it might be not the factor which affect most the quality.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
the max total.sulfur can reach 289, which is too much for a red wine, and the maxi free.sulfur is very much as well, these 2 can be the reason which affect the quality.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
The average pH of red wine is 3.3, is very alkaline.
##
## Pearson's product-moment correlation
##
## data: wine$quality and wine$alcohol
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4373540 0.5132081
## sample estimates:
## cor
## 0.4761663
correlation between quality and alcohol is 0.476
##
## Pearson's product-moment correlation
##
## data: wine$quality and wine$density
## t = -7.0997, df = 1597, p-value = 1.875e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2220365 -0.1269870
## sample estimates:
## cor
## -0.1749192
correlation between quality and density is -0.175
##
## Pearson's product-moment correlation
##
## data: wine$quality and wine$sulphates
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2049011 0.2967610
## sample estimates:
## cor
## 0.2513971
correlation between quality and sulfates is 0.251
##
## Pearson's product-moment correlation
##
## data: wine$quality and wine$fixed.acidity
## t = 4.996, df = 1597, p-value = 6.496e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.07548957 0.17202667
## sample estimates:
## cor
## 0.1240516
correlation between quality and fixed.acidity is 0.124
##
## Pearson's product-moment correlation
##
## data: wine$quality and wine$volatile.acidity
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4313210 -0.3482032
## sample estimates:
## cor
## -0.3905578
correlation between quality and volatile.acidity is -0.391
##
## Pearson's product-moment correlation
##
## data: wine$quality and wine$citric.acid
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1793415 0.2723711
## sample estimates:
## cor
## 0.2263725
correlation between quality and citric.acid is 0.226
##
## Pearson's product-moment correlation
##
## data: wine$quality and wine$residual.sugar
## t = 0.5488, df = 1597, p-value = 0.5832
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03531327 0.06271056
## sample estimates:
## cor
## 0.01373164
correlation between quality and residual.sugar is 0.014
##
## Pearson's product-moment correlation
##
## data: wine$quality and wine$chlorides
## t = -5.1948, df = 1597, p-value = 2.313e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.17681041 -0.08039344
## sample estimates:
## cor
## -0.1289066
correlation between quality and chlorides is -0.130
##
## Pearson's product-moment correlation
##
## data: wine$quality and wine$free.sulfur.dioxide
## t = -2.0269, df = 1597, p-value = 0.04283
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.099430290 -0.001638987
## sample estimates:
## cor
## -0.05065606
correlation between quality and free.sulfur.dioxide is -0.051
##
## Pearson's product-moment correlation
##
## data: wine$quality and wine$total.sulfur.dioxid
## t = -7.5271, df = 1597, p-value = 8.622e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2320162 -0.1373252
## sample estimates:
## cor
## -0.1851003
correlation between quality and total.sulfur.dioxide is -0.185
##
## Pearson's product-moment correlation
##
## data: wine$quality and wine$pH
## t = -2.3109, df = 1597, p-value = 0.02096
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.106451268 -0.008734972
## sample estimates:
## cor
## -0.05773139
correlation between quality and pH is -0.058
This data set contains 1599 observation and 13 variables. Except the quality column, the rest are the chemical factor which is possible to affect the red wine quality.
Quality of the red wine is the point of our analysis, but the wine in our data set, mostly are middle class wine. As we have analyze each factor, their distribution, average, max and min. And we can also see in the correlations table. Alcohol, volatile.acidity, citric.acid, sulfates, total.sulfur.dioxide, density and chlorides have the relatively strong relationship with quality. We will exclude to analyze the relationship between quality and pH, sugar level, free.sulfur.dioxide since their correlation almost equal to 0.
I think except the alcohol and volatile.acidity the main factor, the citric.acid, sulfates, total.sulfur.dioxide, density and chlorides might also help me to support in your investigation.
I did not, til now I think is not necessary. I might create one later.
Is easy to see that there is positive correlation between alcohol and quality, and the higher alcohol in the red wine has higher quality.
This is also very obvious that the volatile.acidity has a strong negative relationship with quality, the higher volatile.acidity the lower quality the red wine has.
The correlation between sulfates and quality is relatively lower than the previous 2 factors, but still you can see they do have a positive relationship, the higher sulfates, the better quality has the wine.
The same conclusion comes here, citric.acid do have a positive relationship with quality.
All right, like you can see in these graphs and analyses, is easy to tell that there are strong relationship between alcohol,volatile.acidity and citric.acid with quality. The more alcohol contains in the wine, better quality is. And more volatile acidity has the wine, the worse quality is.
Compare with these factors, sulfates,total.sulfur.dioxide, density and chlorides have weaker relationship with quality.
The strongest relationship I found with quality are alcohol and volatile.acidity.
As we have found that Alcohol and volatile.acidity seem have the strongest relationship with quality, but we still like to go deeper to see whether there are another factor can affect more quality. I am also interested to know what is influencing alcohol and volatile.acidity. So we will find the relationship of other factors with these 2.
We can definitely notice that there is a strong negative relationship between density of red wine and alcohol, you can also see the high quality red wine is concentrated in the low density but high alcohol area.
Weak relationship.
Weak relationship.
Weak relationship.
Almost no relationship
Weak Relationship
As we can see til here, the strongest relationship of alcohol is with density of red wine.
We can see the both acidity actually have a quite strong relationship, negative correlations. Higher fixed.acidity with lower volatile.acidity. They both also affect on the quality. We can see the better quality wine is concentrated in the high fixed.acidity and low volatile.acidity area.
No relationship
There are some relationship, but is not very strong.
Those factors seem also have a quite strong relationship. The higher citric acid is the lower volatile acidity is. The higher quality wine trend to be in the high citric acid and low volatile acidity area.
No relationship
No relationship
As we can see til here, the strongest relationship of volatile.acidity is with citric.acid and fixed.acidity.
Yes, definitely, as I have analyzed in this part, we can see that alcohol has strong relationship with density of red wine and volatile.acidity with citric.acid and fixed.acidity.
Yes, at beginning I thought the alcohol and volatile.acidity are the only factors which affect strongly the wine quality, but later after plot more relationship from other factors with alcohol and volatile.acidity, I found there are actually many factor they are related mutually and they both work together can affect even more on the wine quality.
As we can see from these graphs, is quite clear that alcohol has the strongest relationship with quality. More alcohol level has the wine, better quality is.
As we can see from these graphs, is quite clear that volatile.acidity has the strongest relationship with quality. Less volatile.acidity has the wine, better quality is.
We can also notice that density is the biggest factor which affect on the alcohol. They both together will influence more on the wine quality.
The better wine has higher alcohol and lower density.
Fixed.volatile is one of the strongest factors associated with the level of volatile.acidity, the correlations of volatile.acidity with it is negative. So more volatile.acidity, low fixed volatile.
It also affects on the wine quality, the good quality wine has low volatile.acidity and high fixed volatile.
Citric acid the second strongest factor associated with the level of volatile.acidity, the correlations of volatile.acidity with it is negative. So more volatile.acidity then less citric acid.
It also affects on the wine quality, the good quality wine has low volatile.acidity and high fixed volatile and citric acid.
I started to analyze firstly the quality distribution to see where are the most wine samples and which quality they have. Then I started to think what will be the relationship between the quality and other factors. So I used a lazy and easy way to see directly, then I used cor.test. Firstly I excluded the few factors which almost do not have any relationship with quality, then I can focus on the ones which have stronger relationship. Then step by step I found deeper factor which can influence on the quality.
In this analyze process, I definitely used a lot of knowledge of R and also tried to think by myself how to start, how to plot, how to analyze, what is my point. That makes me feel really good!
My struggles are at beginning, how could I start this project, what was the main point of this analysis, what was the point I wanted to achieve. I was struggling with the starting point. But later I figure out that the main point is the quality and how the other factors affect on it.
For the future work, I would definitely think even more question about the data set, for example what are the factors that will not affect on the wine quality? Or what will be the perfect wine for health seekers? What is the level of the chemical factor for the 10 level quality wine? I think we will have a chance to dive deeper those questions in the future.